26 research outputs found
Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits
We derive an algorithm that achieves the optimal (within constants)
pseudo-regret in both adversarial and stochastic multi-armed bandits without
prior knowledge of the regime and time horizon. The algorithm is based on
online mirror descent (OMD) with Tsallis entropy regularization with power
and reduced-variance loss estimators. More generally, we define an
adversarial regime with a self-bounding constraint, which includes stochastic
regime, stochastically constrained adversarial regime (Wei and Luo), and
stochastic regime with adversarial corruptions (Lykouris et al.) as special
cases, and show that the algorithm achieves logarithmic regret guarantee in
this regime and all of its special cases simultaneously with the adversarial
regret guarantee.} The algorithm also achieves adversarial and stochastic
optimality in the utility-based dueling bandit setting. We provide empirical
evaluation of the algorithm demonstrating that it significantly outperforms
UCB1 and EXP3 in stochastic environments. We also provide examples of
adversarial environments, where UCB1 and Thompson Sampling exhibit almost
linear regret, whereas our algorithm suffers only logarithmic regret. To the
best of our knowledge, this is the first example demonstrating vulnerability of
Thompson Sampling in adversarial environments. Last, but not least, we present
a general stochastic analysis and a general adversarial analysis of OMD
algorithms with Tsallis entropy regularization for and explain
the reason why works best
Factored Bandits
We introduce the factored bandits model, which is a framework for learning
with limited (bandit) feedback, where actions can be decomposed into a
Cartesian product of atomic actions. Factored bandits incorporate rank-1
bandits as a special case, but significantly relax the assumptions on the form
of the reward function. We provide an anytime algorithm for stochastic factored
bandits and up to constants matching upper and lower regret bounds for the
problem. Furthermore, we show that with a slight modification the proposed
algorithm can be applied to utility based dueling bandits. We obtain an
improvement in the additive terms of the regret bound compared to state of the
art algorithms (the additive terms are dominating up to time horizons which are
exponential in the number of arms)
An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays
We propose a new algorithm for adversarial multi-armed bandits with
unrestricted delays. The algorithm is based on a novel hybrid regularizer
applied in the Follow the Regularized Leader (FTRL) framework. It achieves
regret guarantee, where is the
number of arms, is the number of rounds, and is the total delay. The
result matches the lower bound within constants and requires no prior knowledge
of or . Additionally, we propose a refined tuning of the algorithm,
which achieves
regret guarantee, where is a set of rounds excluded from delay counting,
are the counted rounds, and is the total
delay in the counted rounds. If the delays are highly unbalanced, the latter
regret guarantee can be significantly tighter than the former. The result
requires no advance knowledge of the delays and resolves an open problem of
Thune et al. (2019). The new FTRL algorithm and its refined tuning are anytime
and require no doubling, which resolves another open problem of Thune et al.
(2019)
Connections Between Mirror Descent, Thompson Sampling and the Information Ratio
The information-theoretic analysis by Russo and Van Roy (2014) in combination
with minimax duality has proved a powerful tool for the analysis of online
learning algorithms in full and partial information settings. In most
applications there is a tantalising similarity to the classical analysis based
on mirror descent. We make a formal connection, showing that the
information-theoretic bounds in most applications can be derived from existing
techniques for online convex optimisation. Besides this, for -armed
adversarial bandits we provide an efficient algorithm with regret that matches
the best information-theoretic upper bound and improve best known regret
guarantees for online linear optimisation on -balls and bandits with
graph feedback
Bypassing the Simulator: Near-Optimal Adversarial Linear Contextual Bandits
We consider the adversarial linear contextual bandit problem, where the loss
vectors are selected fully adversarially and the per-round action set (i.e. the
context) is drawn from a fixed distribution. Existing methods for this problem
either require access to a simulator to generate free i.i.d. contexts, achieve
a sub-optimal regret no better than , or are
computationally inefficient. We greatly improve these results by achieving a
regret of without a simulator, while maintaining
computational efficiency when the action set in each round is small. In the
special case of sleeping bandits with adversarial loss and stochastic arm
availability, our result answers affirmatively the open question by Saha et al.
[2020] on whether there exists a polynomial-time algorithm with
regret. Our approach naturally handles the case where the
loss is linear up to an additive misspecification error, and our regret shows
near-optimal dependence on the magnitude of the error
Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback
We study online reinforcement learning in linear Markov decision processes
with adversarial losses and bandit feedback, without prior knowledge on
transitions or access to simulators. We introduce two algorithms that achieve
improved regret performance compared to existing approaches. The first
algorithm, although computationally inefficient, ensures a regret of
, where is the number of
episodes. This is the first result with the optimal dependence in the
considered setting. The second algorithm, which is based on the policy
optimization framework, guarantees a regret of
and is computationally
efficient. Both our results significantly improve over the state-of-the-art: a
computationally inefficient algorithm by Kong et al. [2023] with
regret, for some problem-dependent constant that can
be arbitrarily close to zero, and a computationally efficient algorithm by
Sherman et al. [2023b] with regret
Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously
We develop the first general semi-bandit algorithm that simultaneously
achieves regret for stochastic environments and
regret for adversarial environments without knowledge
of the regime or the number of rounds . The leading problem-dependent
constants of our bounds are not only optimal in some worst-case sense studied
previously, but also optimal for two concrete instances of semi-bandit
problems. Our algorithm and analysis extend the recent work of (Zimmert &
Seldin, 2019) for the special case of multi-armed bandit, but importantly
requires a novel hybrid regularizer designed specifically for semi-bandit.
Experimental results on synthetic data show that our algorithm indeed performs
well uniformly over different environments. We finally provide a preliminary
extension of our results to the full bandit feedback
Refined Regret for Adversarial MDPs with Linear Function Approximation
We consider learning in an adversarial Markov Decision Process (MDP) where
the loss functions can change arbitrarily over episodes and the state space
can be arbitrarily large. We assume that the Q-function of any policy is linear
in some known features, that is, a linear function approximation exists. The
best existing regret upper bound for this setting (Luo et al., 2021) is of
order (omitting all other dependencies), given
access to a simulator. This paper provides two algorithms that improve the
regret to in the same setting. Our first
algorithm makes use of a refined analysis of the Follow-the-Regularized-Leader
(FTRL) algorithm with the log-barrier regularizer. This analysis allows the
loss estimators to be arbitrarily negative and might be of independent
interest. Our second algorithm develops a magnitude-reduced loss estimator,
further removing the polynomial dependency on the number of actions in the
first algorithm and leading to the optimal regret bound (up to logarithmic
terms and dependency on the horizon). Moreover, we also extend the first
algorithm to simulator-free linear MDPs, which achieves regret and greatly improves over the best existing bound
. This algorithm relies on a better alternative
to the Matrix Geometric Resampling procedure by Neu & Olkhovskaya (2020), which
could again be of independent interest.Comment: Accepted to ICML 202
Model Selection in Contextual Stochastic Bandit Problems
We study model selection in stochastic bandit problems. Our approach relies
on a master algorithm that selects its actions among candidate base algorithms.
While this problem is studied for specific classes of stochastic base
algorithms, our objective is to provide a method that can work with more
general classes of stochastic base algorithms. We propose a master algorithm
inspired by CORRAL \cite{DBLP:conf/colt/AgarwalLNS17} and introduce a novel and
generic smoothing transformation for stochastic bandit algorithms that permits
us to obtain regret guarantees for a wide class of base
algorithms when working along with our master. We exhibit a lower bound showing
that even when one of the base algorithms has regret, in general it
is impossible to get better than regret in model selection,
even asymptotically. We apply our algorithm to choose among different values of
for the -greedy algorithm, and to choose between the
-armed UCB and linear UCB algorithms. Our empirical studies further confirm
the effectiveness of our model-selection method.Comment: 12 main pages, 2 figures, 14 appendix page